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1. INTRODUCTION 

The authors are to be commended for jumping 
in to describe support vector machines (SVMs), not 
an easy thing to do since the the literature for SVMs 
has grown at least exponentially in the last few years. 
A Google search for "support vector machines" gave 
"about 1,180,000" hits as of this writing. The au- 
thors have nevertheless made a nice selection of im- 
portant points to emphasize. As noted, SVMs were 
proposed for classification in the early 1990s by ar- 
guments like those behind Figure 1 in their paper. 
The use of SVMs grew rapidly among computer sci- 
entists, as it was found that they worked very well 
in all kinds of practical applications. The theoretical 
underpinnings that went with the original propos- 
als were different than those in the classical statis- 
tical literature, for example, those related to Bayes 
risk, and so had less impact in the statistical liter- 
ature. The convergence of SVMs and regularization 
methods (or, rather the convergence of the "SVM 
community" and the "regularization community") 
was a major impetus in the study of the (classi- 
cal) statistical properties of the SVM. One point at 
which this convergence took place was at an Amer- 
ican Mathematical Society meeting at Mt. Holyoke 
in 1996. The speaker was describing the SVM with 
the so-called kernel trick when an anonymous per- 
son at the back of the room remarked that the SVM 
with the kernel trick was the solution to an opti- 
mization problem in a reproducing kernel Hilbert 
space (RKHS). Once it was clear to statisticians that 
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the SVM can be obtained as the result of an opti- 
mization/regularization problem in a RKHS, tools 
known to statisticians in this context were rapidly 
employed to show how the SVM could be modified 
to take into account nonrepresentative sample sizes, 
unequal misclassification costs and more than two 
classes, and to show in each case that it directly tar- 
gets the Bayes risk under very general circumstances 
(see also [5, 8]). Thus, a "classical" explanation of 
why they work so well was provided. 

2. MERCER'S KERNELS AND POSITIVE 
DEFINITE FUNCTIONS 

Let T be a.d.o. (any dirty old) domain and let 
K(s,t), s,t 6 T , be a symmetric, positive definite 
function of two variables; K is said to be positive 
definite if for any n, and any t\, . . . ,t n ET, the n x n 
matrix with ijth element K(ti,tj) is nonnegative 
definite. In the early SVM literature, as well as in 
the present paper, the kernel is described as having 
a representation K(s,t) = J2T=i ^u^u(s)^ u (t). Here 
the (nonnegative) \ v and the are the eigenvalues 
and eigenvectors of K. A representation as in this 
sum is sufficient for K to be positive definite (see 
[13] on the Mercer Hilbert-Schmidt theorem), but 
the so-called radial basis functions (RBF) popular 
in machine learning, of the form K(s,t) = k(\\s — 1\\), 
s, t in Euclidean d-space E d , do not have a countable 
sequence of eigenvalues and eigenvectors — complex 
exponentials play the role of eigenvectors (see [3]). 

II I|2 / 

The Gaussian kernel K c (x, y) = e~" x ~ y " > c is such an 
example. Although the notion of a countable expan- 
sion was used in uncoupling the linear SVM from its 
linearity restriction (and seems to be repeated over 
and over), the lack of a countable set of eigenvec- 
tors and eigenvalues does not affect the use of the 
Gaussian kernel or any other positive definite func- 
tion in an SVM; as the authors note, only values of 
K are needed. The RBF probably just do not want 
to be called "Mercer's kernels" (!). Positive definite 
functions are sometimes called reproducing kernels, 
relating to their association with RKHS [1]. 

Given a collection of objects (which could be vec- 
tors, images, sounds, graphs, texts, trees, . . . ) in 
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a.d.o. domain T, a positive definite matrix with ij 
entry K(i,j) defines a (squared) distance dij be- 
tween the ith and jth object as 

d ij = K(i,i) + K(j,j)-2K(i,j) 

(and, in addition, this distance comes with an in- 
ner product). It can be argued that using distance 
between objects, defined in some way, is truly funda- 
mental to classification and, therefore, positive def- 
inite kernels, since they provide a distance, play a 
fundamental role. 

3. LARGE MARGIN CLASSIFIERS AND 
REGULARIZATION 

Referring to equation (3.1) in the main paper, note 
that the elementary cost function (1 — yj/(x,)) + de- 
pends only on tj = yj/(xj). If yi and /(xj) have the 
same sign, then / will classify yi correctly, and if 
they have different signs, then / will classify yi in- 
correctly. The term n is frequently called the mar- 
gin, and classifiers that depend on the data only 
through r are called large margin classifiers. The 
cost function c(r) = (r)+ is called the misclassifica- 
tion counter, and it would be considered the ideal 
cost function if it were not for the fact that it leads 
to a nonconvex, nontractable optimization problem. 
Considering Bernoulli data coded as yi = 1 or yi = 0, 
the penalized likelihood estimate, where the cost 
function is the negative log likelihood, goes back at 
least to [12]. In that paper, members of the expo- 
nential family were considered as cost functions and 
it was natural to put the log likelihood in the canon- 
ical form for distributions in the exponential family. 
Thus the log likelihood for Bernoulli data is parame- 
terized by the logit f{x) = logp(x)/(l — p(x)). How- 
ever, if Bernoulli data are recoded as yi = ±1, then 
the log likelihood (cost function) becomes C(y,f) = 
(1 + e~ y *). Since thresholding p(x) at p = 1/2 is 
equivalent to thresholding at / = 0, the penalized 
log likelihood estimate is also a large margin classi- 
fier. 

It turns out that there are lots of large margin 
classifiers with the property that the sign of the es- 
timate that minimizes 

1 n 

Y,<yif(*i)) + n\\ftK 

tends to the sign of the log odds ratio, assuming 
that the problem is tuned adequately and that the 
RKHS associated with K is rich enough for the 



problem at hand. The following rather amazing re- 
sult is from [6]: Let c(z) < c(—z), every z > 0, and let 
c'(0) / exist. If Ec(Yf((X)\X = x)) has a_ global 
minimizer /(x) and /(x) ^ 0, then sign(/(x)) = 
(sign/(x)). A bunch of examples are given in [6]. 
Note the result that the lowly squared difference 
£(y> f) = (y~ f) 2 leads to a large margin classifier 
since if \y\ = 1, then (y — f) 2 = (1 — yf) 2 . This large 
margin classifier (!) is sometimes called the least 
squares support vector machine, but it is nothing 
more than ordinary ridge regression on data that 
have been coded as dbl. Many large margin clas- 
sifiers have been proposed, both convex and non- 
convex, that claim various properties; four of the 
many are described in [11, 14, 17, 19]. These clas- 
sifiers are said to have some special advantages, ei- 
ther theoretical, computational or practical, and it 
is interesting to understand more generally the cir- 
cumstances under which one cost function can be 
better than another. Considering accuracy as well as 
computational tractability, it is unlikely that there 
will be just one best cost function for all classifica- 
tion problems (see the comparison in Figure 1). The 
hinge function occupies a niche as a general purpose 
large margin classifier that is the closest convex up- 
per bound, in some sense, to the misclassification 
function. 

4. PROBABILITY ESTIMATES AND THE 
SVM 

I respectfully disagree with the authors' remark 
that "from a statistical point of view, an impor- 
tant subject remains open: the interpretability of 
the SVM outputs." I think the appropriate interpre- 
tation is that the SVM targets the sign of the log 
odds ratio directly; see [7]. Since the target function 
sign/(x) is discontinuous at /(x), and the SVM is 
found as an optimization problem in a RKHS which 
is typically a space of continuous functions, it can- 
not jump at the boundary, but there may be a Gibbs 
effect there. Since the SVM is generally a smooth ap- 
proximation which tends not to stray too far outside 
of the interval [—1, 1], there is a tendency to believe 
that 2p — 1 can be inferred from the SVM. This is 
not, however the case. A toy problem which is easy 
to drive toward asymptopia illustrates this point. 

In Figure 2, the solid line gives 2p(x) — 1, where 
p(x) is the true conditional probability of the + 
class. Data yi have been generated as yi = 1 with 
probability p(xi) and —1 with probability 1 — p{xi) 
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for 300 equally spaced points x(i) in the interval 
[—2,2]. The logit f(x) has been estimated as f(x) 
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FlG. 2. Solid line: true conditional probability 
2p(x) — 1 = probY = 1; dashed line: fitted SVM; dotted 
line: fitted penalized likelihood estimate. Data yi have been 
generated according to p(x) for 300 equally spaced values of 
x. 



via penalized likelihood, and 2p(x) — 1 is plotted as 
the dotted line, where p = e^/(l + & ). The dashed 
line gives the support vector machine estimate from 
the same data. It can be seen that the SVM is try- 
ing to estimate —1 for x < and +1 for x > 0, which 
is the Bayes optimal classifier here. A small Gibbs 
effect near the class boundary x = is evident, al- 
though the penalized likelihood and SVM will es- 
sentially pick out the same classification boundary. 
Further examples of this phenomenon in the context 
of the multicategory SVM of Lee, Lin and Wahba 
can be found in [5]. A comparative discussion of the 
multicategory SVM and a multicategory penalized 
likelihood estimate can be found in [16]. 

5. SUPPORT VECTOR REGRESSION 

A precursor of the e insensitive loss function can 
be found in [15], where the loss function is L{y, /(xj)) = 
if \y — f\ <e and oo otherwise. In 1969 only highly 
quantized data were available from satellites, but 
computation of such estimates was iffy. 

6. SPARSITY, VARIABLE SELECTION 

In many classification problems, it is desirable to 
learn which components of the proposed attribute 
vector are actually contributing substantially to the 
actual classification. Two recent contributions are 
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Fig. 1. Comparison of the cost functions c(r) = (— r) + , c(t) = (1 — t)+ and c(r) = log 2 (l + c~ T ), which are the misclassifi- 
cation function, the hinge function and the negative log-likelihood function, respectively. Any strictly convex function that goes 
through 1 at r = will be an upper bound on the misclassification function (— r+) and will be a looser bound than some hinge 
function (1 — 0t)+. 
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[4] and [18]. The trick is to add t\ (absolute value) 
penalties on coefficients of variables or terms in the 
penalty functional, which induces sparsity, as is well 
known. An early proponent of adding l\ penalties in 
classification algorithms to promote sparsity is [2]; 
there are many other recent related contributions. 
In practice the major challenge in many problems 
involves which attributes, or clusters of attributes, 
to put into the model to begin with. This challenge 
appears in images, sounds, handwriting, text, ge- 
nomic data, meteorological data, astronomical data 
and elsewhere. Many open questions remain in par- 
ticular contexts. 

7. REGULARIZED KERNELS FROM 
DISSIMILARITY DATA 

Some recent work [9] focused on fitting kernels 
from noisy, scattered, incomplete dissimilarity data, 
which can then be used as a dimension reduction 
tool or in a SVM or multicategory SVM. Given a set 
of objects (protein sequences in [9]) and dissimilarity 
information dij between the ith and jth object, for a 
sufficiently rich subset f2 of the Q) pairs, one finds 
an n x n kernel (nonnegative definite matrix) 
over "object space" to yield 

(1) min \dij — d%j\ + trace K, 

where S is the class of nonnegative definite n x n 
matrices and (% = K^i) + K^jJ) - 2K fJi (i,j). It 
is necessary to choose \i and it is useful to truncate 
the eigenvalues of after the first p, where p can 
be chosen so as to retain some specified percentage 
of the trace. Suppressing \x and the truncation level 
p, a support vector machine f(i), i = 1, . . . ,n, can 
be defined in object space as 

n 

f(i) = Y,c e K(i,l) 

by minimizing 

n 

E( 1 -^/w)++^ tc 
i=i 

or its multicategory analog from [5]. 

To classify a new object (i = n + 1), the "newbie" 
algorithm is used. It goes as follows: Given c^n+i 
for sufficiently many i, find b S E n and constant c 
to minimize 

\ di,n+l — di,n+l| 

i 



over b £ range(-fT) and c — b'K^b > 0. The b and c 
are used to give a new (n + 1) x (n + 1) nonnegative 
definite matrix with K in the upper left block, and 
K{n + 1,71 + 1) = c, K (i,n + 1) = 6j, i = 1, . . . ,n, and 
d ij = K(n + l,n + l)+K(i,i) - 2K(i,n + l). Then 
the classifier evaluated at the (n + l)st object is 

n 

/(n + l) = J>#(n + M). 

i=\ 

Pseudo-attribute vectors may be defined as x(i) = 
{\f\\$\{£)i • ■ • , y/^4> P (i)), where the {A„, 0„} are the 
eigenvalues and eigenvectors of K. The newbie can 
be placed in this pseudo-attribute coordinate sys- 
tem by using its fitted distance from a sufficiently 
large subset of the fitted training set distances. Since 
K(i,j) = (x(i),x(j)), the resulting SVM is linear in 
the pseudo-attribute vectors. However, other SVMs 
can be built on the labeled pseudo-attribute vectors. 

The so-called semisupervised version of this prob- 
lem occurs when only some of the original training 
objects are labeled. Thus, there are three kinds of 
objects: (1) those that are in the training set and la- 
beled; (2) those that are in the training set and not 
labeled, but are used to determine the geometry of 
the object space; and (3) unlabeled newbies. Both 
kinds of unlabeled data can then be classified by the 
SVM. 

The tuning parameter /i in equation (1) can be 
tuned by leaving out pairs of objects (CV2) and 
comparing their observed distances with their fitted 
(pseudo-attribute) distances for a range of /i. 

8. ROBUST MANIFOLD UNROLLING 

A related problem occurs when the objects of in- 
terest are believed to lie in a low-dimensional (non- 
linear) manifold in some higher-dimensional space. 
Here then it is desired to "flatten out" the mani- 
fold and reduce the dimension before carrying out 
a classification or regression operation. Recent ref- 
erences can be found in [10], where we proposed an 
approach related to that in equation (1) with two 
modifications: (a) only distances between k nearest 
neighbors will be used and (b) \x trace -fT is replaced 
by — [i trace K. The effect on the resulting pseudo- 
attribute vectors is that they tend to "flatten out" or 
"unroll" due to the fact that only nearest neighbor 
distances are used, as well as the fact that the minus 
sign propels distant objects to become more distant. 
A longer discussion of the rationale behind this algo- 
rithm and demonstrations of its behavior are found 
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in [10]. The semisuper vised version of this problem 
can be defined similarly, with many potential appli- 
cations. Both of these optimization problems can be 
solved numerically using convex cone optimization 
code. 

9. WHERE ARE WE GOING? 

The theory, computation and application of clas- 
sification problems that relate to support vector ma- 
chines and other regularization based classifiers is by 
no means finished work, although the extent of work 
so far is breathtaking. Many problems remain. Using 
subject matter knowledge to build kernels that em- 
body subject matter information efficiently in vari- 
ous fields remains an interesting challenge. For ex- 
ample, text and language processing have interest- 
ing problems that involve complex relationships be- 
tween components of text. Huge attribute vectors 
and small training sets as occur in genetic data of 
various kinds present their own challeng does 
the merging of heterogenous kinds of information. 
Multiple correlated inputs and outputs provide chal- 
lenges. Improved systematic ways to choose impor- 
tant attributes or groups of attributes remain to be 
found. As the authors note, the relationships be- 
tween statistical learning theory based on Vapnik- 
Chervonenkis dimension and SVM theory based on 
regularization remain to be understood better, as 
do regularization based approaches and other ap- 
proaches to classification. Collaboration between statis- 
ticians, computer scientists, mathematicians and sub- 
ject matter experts will no doubt be needed for many 
of the practical challenges. 
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